---
title: "Simple Linear Regression"
subtitle: "Lecture 03"
author: "Chris Slaughter"
footer: "Bios 6312"
date: last-modified
format:
html:
embed-resources: true
standalone: true
number-sections: true
number-depth: 4
anchor-sections: true
smooth-scroll: true
theme: journal
toc: true
toc-depth: 4
toc-title: Contents
toc-location: left
code-link: false
code-tools: true
code-fold: true
code-block-bg: "#f1f3f5"
code-block-border-left: "#31BAE9"
reference-location: margin
fig-cap-location: margin
fontsize: medium
execute:
warning: false
message: false
editor:
markdown:
wrap: 72
---
## General Regression Setting
### Two variable setting
- Many statistical problems examine the association between two
variables
- Outcome variable (response variable, dependent variable)
- Grouping variable (covariate, predictor variable, independent
variable)
- Compare distribution of the outcome variable across levels of the
grouping variable
- Groups are defined by the grouping variable
- Within each group, the grouping variable is constant
- In intro course, statistical analysis is characterized by two
factors
- Number of groups (samples)
- If subjects in groups are independent
- In the two variable setting, statistical analysis is more generally
characterized by the grouping variable. If the grouping variable is
- Constant: One sample problem
- Binary: Two sample problem
- Categorical: $k$ sample problem (e.g. ANOVA)
- Continuous: Infinite sample problem (analyzed with regression)
- Regression thus *extends* the one- and two-sample problems up to
infinite sample problems
- Of course, in reality we never have *infinite* samples, but
models that can handle this case are the ultimate generalization
- Continuous predictors of interest
- Continuous adjustment variables
quarto-executable-code-5450563D
```r
#| label: fig-chol-age
#| fig-cap: "Scatterplot of cholesterol and age. Age is a continuous grouping variable."
### Make a cholesterol and age dataframe. Set the random number seed so everything is reproducible
set.seed(19)
plotdata <- data.frame(age=c(63, 63, rep(65:80,15), rep(81:85,10), 86,86,86, 87,89, 90, 93, 95, 100),
chol=NA)
plotdata$chol <- 190 + .5*plotdata$age + rnorm(length(plotdata$age), 0, 15)
library(ggplot2)
ggplot(plotdata, aes(x=age, y=chol)) + geom_point() + ylab("Cholesterol") + xlab("Age (years)") + theme_bw()
```
### Regression versus two sample approaches
- With a binary grouping variable, regression models reduce to the
corresponding two variable methods
- Linear regression with a binary predictor
- t-test, equal variance: Classic linear regression
- t-test, unequal variance: Linear regression with robust standard
errors (approximately)
- Logistic regression with a binary predictor
- (Pearson) Chi-squared test: Score test from logistic regression
- Proportional odds regression with a binary predictor
- Wilcoxon Rank Sum Test
- Cox (proportional hazards) regression with a binary predictor
- Log-rank test: Score test from Cox regression
### Guiding principle
- Everything can be recast as a regression problem.
## Motivating Problem: Cholesterol and Age
### Definitions
- Is there an association between cholesterol and age?
- Scientific question: Does aging effect cholesterol?
- Statistical question: Does the distribution of cholesterol differ
across age groups?
- Acknowledges variability in the response (cholesterol)
- Acknowledges cause-effect relationship is uncertain
- Association does not imply causation
- Any observed differences could be due to other factors. For
example, the association could be with year of birth rather than
age.
- Continuous response variable: Cholesterol
- Continuous grouping variable (predictor of interest): Age
- An infinite number of ages are possible
- We will not sample every possible age, and may round for
simplicity
### Simple Regression Model
- Attempt to answer scientific question by assessing linear trends in
average cholesterol
- Estimate the best fitting line to average cholesterol within age
groups
$$
E[\textrm{Chol} | \textrm{Age}] = \beta_0 + \beta_1 \times \textrm{Age}
$$
- The expected value of cholesterol given age is modeled using an
intercept ($\beta_0$) and slope ($\beta_1$)
- An association exists if the slope is nonzero
- A non-zero slope indicates that the average cholesterol will be
different across different age groups
#### Descriptive plots of cholesterol and age
quarto-executable-code-5450563D
```r
#| label: fig-chol-age-lowess
#| fig-cap: "Scatterplot of cholesterol and age. Association is described using a lowess smooth line."
ggplot(plotdata, aes(x=age, y=chol)) + geom_point() + ylab("Cholesterol") + xlab("Age (years)") + theme_bw() + geom_smooth()
```
quarto-executable-code-5450563D
```r
#| label: fig-chol-age-linear
#| fig-cap: "Scatterplot of cholesterol and age. Association is described using a linear trend line."
ggplot(plotdata, aes(x=age, y=chol)) + geom_point() + ylab("Cholesterol") + xlab("Age (years)") + theme_bw() + geom_smooth(method="lm")
```
#### Simple linear regression model
- The simple regression model produces an easy to remember (but
approximate) rule of thumb.
- "Normal cholesterol is 190 plus half your age"
- $E[\textrm{Chol} | \textrm{Age}] = 190 + 0.50 \times \textrm{Age}$
- Data were generated using this model (estimates, below, will be
different)
#### Frequentist Estimates and Interpretation
- Raw, unformatted estimates, model summary, and confidence intervals
for age
quarto-executable-code-5450563D
```r
m.chol <- lm(chol ~ age, data=plotdata)
summary(m.chol)
confint.default(m.chol, "age") # Based on asymptotic Normality
```
$E[\textrm{Chol} | \textrm{Age}] = 203.2 + 0.32 \times \textrm{Age}$
- Using finalfit to create a nicer output table of the coefficients
and confidence intervals
quarto-executable-code-5450563D
```r
mykable = function(x){
knitr::kable(x, row.names = FALSE, align = c("l", "l", "r", "r", "r", "r", "r", "r", "r"),
booktabs=TRUE)
}
library(finalfit)
library(dplyr)
library(rms)
explanatory = c("age")
dependent = 'chol'
label(plotdata$chol) <- "Cholesterol"
label(plotdata$age) <- "Age (years)"
plotdata %>%
finalfit(dependent, explanatory) %>% mykable()
```
#### Bayesian Estimates and Interpretation
- Bayesian approach to the linear model requires specifying
- The model, e.g. a linear model with intercept and slope for age,
Normally distributed errors and constant variance
- Prior distributions on parameters
- For the simple linear regression model, we have parameters
$\beta_0$, $\beta_1$, and $\sigma$.
- For now, we will use default prior distributions that are
are intended to be *weakly informative* in that they provide
moderate regularization and help stabilize computation. See
the [STAN
documentation](https://mc-stan.org/rstanarm/reference/priors.html)
for more details
- Appropriate priors can be based on scientific considerations
- Sensitivity analyses can evaluate the the robustness of
finding to different prior assumptions
- Output from Bayesian linear regression
quarto-executable-code-5450563D
```r
library(rstanarm)
library(bayesplot)
fit1 <- stan_glm(chol ~ age,
data=plotdata, family=gaussian(),
seed=1234,
refresh=0)
summary(fit1, digits=2, prob=c(.025, .5, .975))
```
- And a summary of the prior distributions used
quarto-executable-code-5450563D
```r
prior_summary(fit1, digits = 2)
```
- Interpretation
- Slope for age is of primary scientific importance
- *A priori* we assume that no association between age and
cholesterol. Specifically, we assumed a Normal prior with
location (mean) of 0 and scale (standard devation) of 6 for
$\beta_1$.
- Conditional on the data, we estimate that for every 1 year
increase in age, cholesterol increases by 0.32 mg/dl (95%
credible interval 0.06 to 0.58).
#### Comparison of Frequentist and Bayesian interpretation of the slope
- Following was given by gpt-4.0 [^1]
[^1]: Query "Explain how a slope from simple linear regression is
interpreted using language a college graduate could understand.
Provide a Bayesian and Frequentist interpretation."
::: {.classname style="color: gray"}
In simple linear regression, the slope (often denoted as "b" or "beta")
is interpreted as the expected change in the dependent (or response)
variable for a one-unit increase in the independent (or predictor)
variable, assuming all other variables are held constant.
From a Frequentist perspective, the slope is considered a fixed but
unknown value which is estimated from the sample data. If the slope is
found to be 2, for instance, it would suggest that for every one-unit
increase in our predictor variable, we expect, on average, a 2-unit
increase in the response variable. This is based on the data at hand and
is subject to sampling variability. Confidence intervals can be
constructed around this estimate to provide a range of plausible values
for this unknown fixed parameter. If a 95% confidence interval for the
slope is (1.5, 2.5), it would suggest that we are 95% confident that the
true slope falls within this range.
In contrast, a Bayesian approach treats the slope as a random variable.
Rather than a single fixed value, the slope has a distribution of
possible values. This distribution combines our prior beliefs about what
the slope might be (prior distribution) and the observed data
(likelihood). The resulting distribution (posterior distribution)
reflects our updated belief about the slope after seeing the data. If
the mean of the posterior distribution is 2, it suggests that, given the
prior and the data, our best guess is that a one-unit increase in the
predictor variable corresponds to a 2-unit increase in the response
variable on average. A credible interval, say a 95% credible interval of
(1.5, 2.5), suggests that there is a 95% probability that the true value
of the slope falls within this range, given our prior and the observed
data.
:::
### Uses of Regression
- Borrowing information
- Use other groups to make estimates in groups with sparse data
- Intuitively, 67 and 69 year olds would provide some relevant
information about 68 year olds
- Assuming a straight line relationship tells us about other, even
more distant, individuals
- If we do not want to assume a straight line, we may only want to
borrow information from nearby groups
- Locally weighted scatterplot smooth line (lowess) added to the
previous figures
- Splines discussed in future lectures
- May not want to borrow too much information
- Linear relationship is an assumption, with often low power to detect
departures from linearity
- Always avoid extrapolating beyond the range of the data (e.g. ages
under 65 or over 100)
- Defining "Contrasts"
- Define a comparison across groups to use when answering
scientific questions
- If the straight line relationship holds, the slope is the
difference in mean cholesterol levels between groups differing
by 1 year in age
- Do we want to assume that comparisons of 65 to 66 year old
subjects are the same as comparisons of 95 to 96 year old
subjects?
- If a non-linear relationship, the slope is still the average
difference in mean cholesterol levels between groups differing
by 1 year in age
- Slope is a (first order or linear) test for trend
### Linear Regression Inference
- Regression output provides
- Estimates
- Intercept: Estimated mean cholesterol when age is 0
- Slope: Estimated average difference in average cholesterol
for two groups differing by 1 year in age
- Standard errors
- Confidence intervals
- P-values for testing
- Intercept is zero (usually unimportant)
- Slope is zero (test for linear trend in means)
- Interpretation
From linear regression analysis, we estimate that for each year
difference in age, the difference in mean cholesterol is 0.47 mg/dL. A
95% confidence interval (CI) suggests that this observation is not
unusual if the true difference in mean cholesterol per year difference
in age were between 0.21 and 0.73 mg/dL. Because $p < 0.001$, we reject
the null hypothesis that there is no linear trend in the average
cholesterol across age groups using a significance level, $\alpha$, of
$0.05$.
## Simple Linear Regression
### Ingredients
- Response
- The distribution of this variable will be compared across groups
- Linear regression models the mean of the response variable
- Log transformation of the response corresponds to modeling the
geometric mean
- Notation: Is is extremely common to use $Y$ to denote the
response variable when discussing general methods
- Predictor
- Group membership is measured by this variable
- Notation
- When not using mnemonics, will be referred to as the $X$
variable in simple linear regression (linear regression with
one predictor)
- Later, when we discuss multiple regression, will refer to
$X_1, X_2, \ldots, X_p$ when there are up to $p$ predictors
- Regression Model
- We typically consider a "linear predictor function" that is
linear in the modeled predictors
- Expected value (i.e. mean) of $Y$ for a particular value of $X$
- $E[Y | X] = \beta_0 + \beta_1 \times X$
- In a deterministic world, a line is of the form $y = mx + b$
- With no variation in the data, each value of $y$ would like
exactly on a straight line
- Intercept $b$ is values of $y$ when $x = 0$
- Slope $m$ is the difference in $y$ for a one unit difference
in $x$
- Statistics in not completely deterministic. The real world has
variability
- Response within groups is variable (people born on the same
day will have different cholesterol levels!)
- Randomness due to other variables impacting cholesterol
- Inherent randomness
- The regression line thus describes the central tendency of the data
in a scatterplot of the response versus the predictor
- Interpretation of regression parameters
- Intercept $\beta_0$: Mean $Y$ for a group with $X=0$
- Often $\beta_0$ is not of scientific interest
- May be out of the range of data, or even impossible to
observe $X=0$
- Slope $\beta_1$: Difference in mean $Y$ across groups differing
in $X$ by 1 unit
- Usually measures association between $Y$ and $X$
- $E[Y | X] = \beta_0 + \beta_1 \times X$
- Derivation of interpretation
- Simple linear regression of response $Y$ on predictor $X$
- Mean of any arbitrary group can be derived from the $$
Y_i = \beta_0 + \beta_1 X_i
$$
- Interpretation determined by considering possible values of $X$
- Model: $E[Y_i | X_i] = \beta_0 + \beta_1 \times X_i$
- When $X_i = 0$, $E[Y_i | X_i = 0 ] = \beta_0$
- When $X_i = x$, $E[Y_i | X_i = x ] = \beta_0 + \beta_1 x$
- When $X_i = x + 1$,
$E[Y_i | X_i = x + 1 ] = \beta_0 + \beta_1 x + \beta_1$
- We can use the above to get an equation for $\beta_1$
$$
E[Y_i | X_i = x + 1 ] - E[Y_i | X_i = x ] = \\
(\beta_0 + \beta_1 x + \beta_1) - (\beta_0 + \beta_1 x) \\
= \beta_1
$$
- Thus, $\beta_1$ is the expected change in $Y$ per one unit
increase in $x$
### Matrix Notation
- Using scalars, the simple linear regression model can be written as
- $Y_i = \beta_0 + \beta_1 \times X_i + \epsilon_i$
- $i = 1, \ldots, n$
- $i$ indexes the independent sampling units (e.g. subjects)
- $n$ is the total number of independent sampling units
- This formulauation drops the expected value notation, add in
$\epsilon_i$
- $\epsilon_i$ are the "Residuals" or "Errors"
- $E[\epsilon_i] = 0$
- $V[\epsilon_i] = \sigma^2$ (constant variance assumption)
```{=tex}
\newcommand\Ybd{{\boldsymbol{Y}}}
\newcommand\Xbd{{\boldsymbol{X}}}
\newcommand\betabd{{\boldsymbol{\beta}}}
\newcommand\epsilonbd{{\boldsymbol{\epsilon}}}
\newcommand\zerovec{{\boldsymbol{0}}}
\newcommand\onevec{{\boldsymbol{1}}}
\newcommand\eye{{\boldsymbol{I}}}
```
- Matrix notation:
${\boldsymbol{Y}}= {\boldsymbol{X}}{\boldsymbol{\beta}}+ {\boldsymbol{\epsilon}}$
${\boldsymbol{Y}}= \left( \begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{array} \right)_{n\times1}$
${\boldsymbol{X}}= \left( \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array} \right)_{n\times2}$
${\boldsymbol{\beta}}= \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \end{array} \right)_{2\times1}$
${\boldsymbol{\epsilon}}= \left( \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{array} \right)_{n\times1}$
$E[{\boldsymbol{\epsilon}}] = {\boldsymbol{0}}$, where
${\boldsymbol{0}}= \left( \begin{array}{c} 0 \\ 0 \\ \vdots \\0 \end{array} \right)_{n\times1}$
$V[{\boldsymbol{\epsilon}}] = \sigma^2 {\boldsymbol{I}}$, where
${\boldsymbol{I}}= \left( \begin{array}{cccc} 1 & 0 & \ldots & 0 \\ 0 & 1 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & 1 \end{array} \right)_{n\times n}$
$E[{\boldsymbol{Y}}] = {\boldsymbol{X}}{\boldsymbol{\beta}}$
- I am using standard notation to indicate matrices/vectors and
scalars
- Boldface indicates a vector or matrix (${\boldsymbol{Y}}$,
${\boldsymbol{X}}$, ${\boldsymbol{\beta}}$,
${\boldsymbol{\epsilon}}$, ${\boldsymbol{0}}$, ${\boldsymbol{I}}$)
- Normal typeface indicates a scalar ($Y_i$, $x_i$, $\beta_0$,
$\beta_1$, $\epsilon_i$, $0$, $1$)
## Example Analysis of BMI (Part 1)
- Example analysis conducted in class involving BMI (response) with
gender (predictor 1) and age (predictor 2)
- This is "Lab 1" and will serve as an example of how future labs will
proceed
## Residuals
- Often linear regression models are specified in terms of the
response instead of the mean response
- Include an error term in the model, $\epsilon_i$
- Model $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$
- The linear regression model is divided into two parts
- The mean, or systematic, part (the "signal")
- The error, or random, part (the "noise")
- Residuals
- $\hat{\epsilon}_i = Y_i - \left(\hat{\beta_0} + \hat{\beta_1} X_i\right)$
- $\hat{{\boldsymbol{\epsilon}}} = {\boldsymbol{Y}}- {\boldsymbol{X}}\hat{{\boldsymbol{\beta}}}$
- $\hat{{\boldsymbol{\beta}}} = \left({\boldsymbol{X}}'{\boldsymbol{X}}\right)^{-1} \left({\boldsymbol{X}}'{\boldsymbol{Y}}\right)$
- The mean of the residuals is $0$
- The standard deviation of the residuals is the "Root Mean Square
Error"
- In our example analysis of BMI and gender, the RMSE is exactly
equal to the pooled estimate of the standard deviation from a
two-sample, equal variance t-test
- In our example analysis of BMI and age, the RMSE is the square
root of the average variances across the age groups
- In many textbooks, $\epsilon_i \sim N(0, \sigma^2)$
- A common $\sigma^2$ implies constant variance across all levels
of the grouping variable, "homoscedasticity"
- Normality of the residuals is a nice property, but it is not
necessary (and rarely observed in practice)
- We will discuss how lack of Normality and heteroscedasticity
impact statistical inference
## Assumptions and Inference with Linear Regression
### The Big Three Assumptions
- Most common uses of regression
- Prediction: Estimating what a future value of $Y$ will be based
on observed $X$
- Comparisons within groups: Describing the distribution of $Y$
across levels of the grouping variable $X$ by estimating the
mean $E[Y | X]$
- Comparisons across groups: Differences appear across groups if
the regression parameter slope estimate $\beta_1$ is non-zero
- Valid statistical inference (CIs, p-values) about *associations*
requires three general assumptions
- Assumption 1: Approximately Normal distributions for the parameter
estimates
- Normal data or "large" N
- It is often surprising how small "large" can be
- Definition of large depends on the error distribution and
relative sample sizes within each group
- With exactly Normally distributed errors, only need one
observation (or two to estimate a slope)
- With very heavy tails, "large" can be very large
- See Lumley, et al., Ann Rev Pub Hlth, 2002
- Assumption 2: Independence of observations
- Classic regression: Independence of all observation (now)
- Robust standard errors: Correlated observations within
identified clusters (later)
- Assumption 3: Assumption about the variance of observations within
groups
- Classic regression: Homoscedasticity (equal variance across
groups)
- Robust standard errors: Allows for unequal variance across
groups
### Additional Assumptions for Additional Inference
- Note that some textbooks will claim there are more than three
assumptions. In truth, additional assumptions are not needed to make
the aforementioned statistical inference about *associations*.
However \...
- Valid statistical inference (CIs, p-values) about *means responses
in specific groups* requires a further assumption
- Assumption 4: Adequacy of the linear model
- If we are trying to borrow information about the mean from
neighboring groups, and we are assuming a straight line
relationship, the straight line needs to be true
- No longer saying there is just a linear trend in the means, but
now need to believe that all the means lie on a straight line
- Note that we can model transformations of the measured predictor
- For inference about *individual observations* (prediction intervals,
P-values) in specific groups requires another assumption
- Assumption 5: Assumptions about the distribution of the errors
within each group (a very strong assumption)
- Classically: Errors have the same Normal distribution within
each grouping variable
- Robust standard error will not help
- Prediction intervals assume a common error distribution across
groups (homoscedasticity)
- Possible extension: Errors have the same distribution, but not
necessarily Normal (rarely implemented in software)
- Bootstrapping
- Bayesian analysis
- Other flexible approaches
#### Scenario 1: All model assumptions hold
quarto-executable-code-5450563D
```r
#| label: fig-assumptions-true
#| fig-cap: Scatterplot with best fit line where assumptions of Normality, linearity, homoskedasticity all hold
set.seed(1234)
n <- 200
regassumptions <- data.frame(x=seq(from=0, to=100, length=n))
# Linear model correct, Normal errors
regassumptions$y1 <- 5 + 0.2*regassumptions$x + rnorm(n,0,5)
# Linear model correct, Skewed errors
regassumptions$y2 <- 5 + 0.2*regassumptions$x + rlnorm(n,0,1)
# Linear model correct, errors increasing with predictor (so increasing with Y too)
regassumptions$y3 <- 5 + 0.2*regassumptions$x + rnorm(n,0,1+regassumptions$x*.1)
# Linear model incorrect, Normal error
regassumptions$y4<- 5 + 0.2*regassumptions$x + 0.02*(regassumptions$x)^2 + rnorm(n,0,5)
ggplot(regassumptions, aes(x=x, y=y1)) + geom_point() + geom_smooth(method="lm")
```
quarto-executable-code-5450563D
```r
#| label: fig-assumptions-true-diag
#| fig-cap: Diagnostic plots where assumptions of Normality, linearity, homoskedasticity all hold
library(gglm) # Load gglm
model.y1 <- lm(y1 ~ x, data = regassumptions)
gglm(model.y1)
```
- Can estimate the mean of Y at a given value of X
- Can make inference about the association between X and Y (based on
the slope)
- Can predict new values of Y given X
#### Scenario 2: Error distribution is not Normally distributed
quarto-executable-code-5450563D
```r
#| label: fig-assumptions-error-distribution
#| fig-cap: Scatterplot with best fit line where the error distribution is skewed to the right
ggplot(regassumptions, aes(x=x, y=y2)) + geom_point() + geom_smooth(method="lm")
```
quarto-executable-code-5450563D
```r
#| label: fig-assumptions-error-distribution-diag
#| fig-cap: Diagnostic plots where the error distribution is skewed to the right
model.y2 <- lm(y2 ~ x, data = regassumptions)
gglm(model.y2)
```
- If the sample size is large enough, we can estimate the mean of Y at
a given value of X (Central Limit Theorem)
- If we use "robust" standard errors, we can make inference about the
association between X and Y (based on the slope)
- Robust standard error relax the assumption of
homoskedasticity/common variance
- We *cannot* predict new values of Y given X. This requires the
correct error distribution.
#### Scenario 3: Variance increasing with predictor and expected value of Y (heteroskedasticity)
quarto-executable-code-5450563D
```r
#| label: fig-assumptions-heteroskedasticity
#| fig-cap: Scatterplot with best fit line where the error distribution variance is increasing
ggplot(regassumptions, aes(x=x, y=y3)) + geom_point() + geom_smooth(method="lm")
```
quarto-executable-code-5450563D
```r
#| label: fig-assumptions-heteroskedasticity-diag
#| fig-cap: Diagnostic plot where the error distribution variance is increasing
model.y3 <- lm(y3 ~ x, data = regassumptions)
gglm(model.y3)
```
- If the sample size is large enough, we can estimate the mean of Y at
a given value of X (Central Limit Theorem)
- If we use "robust" standard errors, we can make inference about the
association between X and Y (based on the slope)
- Robust standard error relax the assumption of
homoskedasticity/common variance
- We *cannot* predict new values of Y given X. This requires the
correct error distribution (and mean model, but that is correct)
#### Scenario 4: Mean model incorrectly specified
quarto-executable-code-5450563D
```r
#| label: fig-assumptions-mean-incorrect
#| fig-cap: Scatterplot with best fit line where the mean model is not correctly specified
ggplot(regassumptions, aes(x=x, y=y4)) + geom_point() + geom_smooth(method="lm")
```
quarto-executable-code-5450563D
```r
#| label: fig-assumptions-mean-incorrect-diag
#| fig-cap: Diagnostic plots where the mean model is not correctly specified
model.y4 <- lm(y4 ~ x, data = regassumptions)
gglm(model.y4)
```
- Because our model is wrong, we *cannot* estimate the mean of Y at a
given value of X. It will be biased, even if the sample size is
large, the model will still be incorrect
- The good news is that we can fix our mean model to more flexibly
model the association.
- Here, we have just included a single slope term to model the
association with $X$, $E[Y|X] = \beta_0 + \beta_1*X$.
- The correct model in this simple example is
$E[Y|X] = \beta_0 + \beta_1*X + \beta_2*X^2$. Note that this
model is still a "linear model". A "linear model" is linear
*in the parameters* ($\beta_0$, $\beta_1$, $beta_2$). We can
included the quadratic *predictor* $X^2$ in the model and
still have a model that is linear in the parameters.
- If we use "robust" standard errors, we can make inference about the
association between X and Y (based on the slope, $\beta_1$)
- Robust standard error relax the assumption of homoskedasticity
- While the model is not a great fit, $\beta_1$ may summarize a
first-order trend
- We *cannot* predict new values of Y given X. This requires the
correct mean model and error distribution.
#### Summary
- Regression based inference about *associations* is far more
trustworthy than estimation of group means or individual
predictions.
- There is much to be gained by using robust variance estimates
::: {#tbl-panel layout-ncol="2"}
| Linearity | Homoskedasticity | Normality |
|-----------|------------------|-----------|
| Y | Y | Y |
| Y | Y | N |
| Y | N | Y |
| Y | N | N |
| N | Y | Y |
| N | Y | N |
| N | N | Y |
| N | N | N |
: Assumption Holds (Yes/No)
+-------------------+-------------------+-------------------+
| $\hat{\beta}$ | $\ | $\ |
| | hat | hat{\textrm{Var } |
| | {\textrm{Var}}_ { | }_R(\hat{\beta})$ |
| | NR}(\hat{\beta})$ | |
+===================+===================+===================+
| Y | Y | Y |
+-------------------+-------------------+-------------------+
| Y | M^2^ | M^2^ |
+-------------------+-------------------+-------------------+
| Y | N | Y |
+-------------------+-------------------+-------------------+
| Y | N | M^2^ |
+-------------------+-------------------+-------------------+
| M^1^ | N | M^3^ |
+-------------------+-------------------+-------------------+
| M^1^ | N | M^2,3^ |
+-------------------+-------------------+-------------------+
| M^1^ | N | M^2^ |
+-------------------+-------------------+-------------------+
| M^1^ | N | M^2,3^ |
+-------------------+-------------------+-------------------+
: Valid Inference (Yes/No/Maybe)
Summary of what is required for valid estimates of $\beta$ and the
variance of beta using the classical, non-robust (NR) and robust (R)
approaches. The classic (non-robust) variance is
$\hat{\textrm{Var}}_{NR}(\hat{\beta})$ and the robust variance is
$\hat{\textrm{Var}}_R(\hat{\beta})$
:::
- ^2^ Approximately valid under large sample sizes
- ^3^ If $X$ is random, but not if $X$ is fixed
### Interpreting "Positive" Results
- Slope is statistically different from 0 using robust standard errors
- Observed data is atypical of a setting with no linear trend in mean
response across groups
- Data suggests evidence of a trend toward larger (or smaller) means
in groups having larger values of the predictor
- (To the extent the data appears linear, estimates of the group means
will be reliable)
### Interpreting "Negative" Results
- Many possible reasons why the slope is not statistically different
from 0 using robust standard errors
- There may be no association between the response and predictor
- There may be an association, but not in the parameter considered
(the mean response)
- There may be an association in the parameter considered, but the
best fitting line has zero slope
- There may be a first order trend in the parameter considered, but we
lacked the precision to be confident that it truly exists (a type II
error)
## Model Checking
- Much statistical literature has been devoted to methods for checking
the assumptions for regression models
- My philosophy: Model checking is generally fraught with peril as it
necessarily involves multiple comparisons
- We cannot reliably use the sampled data to assess whether it
accurately portrays the population
- We are more worried about the data from the population that we
might not have sampled
- It is not so much the abnormal points that we see, but the ones
that are hiding in the population that will make our model
perform badly
- But, do tend to worry more when we see a tendency to extreme
outliers in the sample or clear departures from model
assumptions
- If we over-check our model and make adjustments to the model
based on our observed data, we will inflate the type I error
rate (i.e. will be more likely to claim statistical significance
when it doesn't really exist). We run the risk of creating a
model that fits our data well but does not generalize.
- Estimates are biased away from the null
- Reported standard errors are too small
- If we fish through the data, we will always find significant
results
- In clinical trials, often Phase II results are not able to
be replicated in Phase III trials
- Instead of extensive model checking, go back to our choices of
inference when planning our analysis
- Best to plan for unusual data
- There is often little to be lost and much to be gained by using
the robust standard error estimates
- By using robust errors, avoids much of the need for model
checking
- Model checking is almost entirely data driven
- Robust standard errors is a more logical scientific approach
- Minimize the need to presume more knowledge than the
question we are trying to answer
- E.g., if we don't know how the means might differ, why
should we presume to know how the variances or the shape of
the distribution might behave?
### Residuals versus fitted values
- Plot of $\hat{\epsilon_i}$ versus $\hat{y_i}$
- If assumptions hold, should be a random pattern about zero. See
@fig-assumptions-true-diag for example.
- If you have *a priori* concerns about non-constant variance, this is
one potential check
- E.g., it is fairly common for uncertainty to increase with
outcome. See @fig-assumptions-heteroskedasticity-diag for the
resulting "fan" pattern that would appear
### Scale-location plot
- The scale-location plot is very similar to residuals vs fitted
values, and is used to evaluate the homoskedasticity assumption
- It uses the square root of the absolute value of standardized
residuals instead of plotting the residuals themselves
- We want to check two things
- That the best fit line is approximately horizontal. If it is,
then the average magnitude of the standardized residuals isn't
changing much as a function of the fitted values.
- That the spread around the fit line doesn't vary with the fitted
values. If so, then the variability of magnitudes doesn't vary
much as a function of the fitted values.
### QQ plot
- Used to evaluate Normality of the residuals
- If assumptions hold, points should lie on a 1:1 line
- Plot of standardized residuals versus theoretical quantities from a
N(0,1) distribution
- If assumptions hold, the standardized residuals should be a
random sample from a N(0,1) distribution
- We can use quantiles of the Normal distribution to check how
closely the observed matches the expected
- The 50th percentile of the residuals should be
`r round(qnorm(.5,0,1),2)`
- The 75th percentile of the residuals should be
`r round(qnorm(.75,0,1),2)`
- The 90th percentile of the residuals should be
`r round(qnorm(.90,0,1),2)`
- The 95th percentile of the residuals should be
`r round(qnorm(.95,0,1),2)`
- The 97.5th percentil of the residuals should be
`r round(qnorm(.975,0,1),2)`
- Similar idea would be a histogram or density plot of the
(standardized) residuals that could be visually evaluated for
Normality
### Residuals versus leverage
- Used to check for outliers
- Plot can help us to identify influential observations, if there are
any
- Influence differs from leverage. Not all outliers influence the
regression coefficient estimates
## Linear Regression with Robust Standard Error estimates
- There are several libraries available in R for fitting models with
robust error estimates
- There are also several different flavors of robust estimators
- For now, we are going to consider the default "Huber-White
sandwich estimator"
- In Stata, the Huber-White robust estimate of the standard error
can be obtained using the 'robust' option
``` stata
regress chol age, robust
```
- In the following examples I provide estimates using classical linear
regression and linear regression estimate using robust standard
errors. For each of these example compare
- Estimates of the intercept, $\hat{\beta_0}$
- Estimates of the slope, $\hat{\beta_1}$
- Estimates of the standard errors,
$\hat{\textrm{se}}(\hat{\beta_0})$ and
$\hat{\textrm{se}}(\hat{\beta_1})$
### Comparison when Normality, constant variance holds
#### Age and Cholesterol
quarto-executable-code-5450563D
```r
fit.ols <- ols(chol ~ age, data=plotdata, x=TRUE)
fit.ols
```
quarto-executable-code-5450563D
```r
robcov(fit.ols)
```
#### Scenario 1 dataset
quarto-executable-code-5450563D
```r
ols(y1 ~ x, data=regassumptions, x=TRUE)
robcov(ols(y1 ~ x, data=regassumptions, x=TRUE))
```
### Comparison when variance is increasing with predictor (Scenario 3)
quarto-executable-code-5450563D
```r
ols(y3 ~ x, data=regassumptions, x=TRUE)
robcov(ols(y3 ~ x, data=regassumptions, x=TRUE))
```
### Comparison when error variance is skewed to the right (Scenario 2)
quarto-executable-code-5450563D
```r
ols(y2 ~ x, data=regassumptions, x=TRUE)
robcov(ols(y2 ~ x, data=regassumptions, x=TRUE))
```
### Comments
```{=tex}
\newcommand\seb0{{\hat{\textrm{se}}(\hat{\beta_0})}}
\newcommand\seb1{\hat{\textrm{se}}(\hat{\beta_1})}
```
+-----------+-----------+-----------+-----------+-----------+
| Model | Classical | Robust | Classical | Robust \| |
| | Intercept | Intercept | Slope | Slope |
+===========+===========+===========+===========+===========+
| Age - | | | | |
| Ch | | | | |
| olesterol | | | | |
| | | | | |
| As | | | | |
| sumptions | | | | |
| hold | | | | |
+-----------+-----------+-----------+-----------+-----------+
| Scenario | | | | |
| 1 | | | | |
| | | | | |
| As | | | | |
| sumptions | | | | |
| hold | | | | |
+-----------+-----------+-----------+-----------+-----------+
| Scenario | | | | |
| 3 | | | | |
| | | | | |
| I | | | | |
| ncreasing | | | | |
| variance | | | | |
+-----------+-----------+-----------+-----------+-----------+
| Scenario | | | | |
| 2 | | | | |
| | | | | |
| Skewed | | | | |
| variance | | | | |
+-----------+-----------+-----------+-----------+-----------+
: Comparison of standard error estimates
- Point estimates $\hat{\beta_0}$ and $\hat{\beta_1}$ were the same
for robust and classical standard error
- Point estimates $\hat{\textrm{se}}(\hat{\beta_0})$ and
$\hat{\textrm{se}}(\hat{\beta_1})$ were different for robust and
classical standard error
- In general, it is hard to predict the effect of using robust
standard errors
- These were just a a couple of examples and do not constitute a
thorough examination of the impact of using robust standard
errors
- When the model assumptions hold, for these two scenarios, robust
standard error estimates were larger than classical standard error
estimates
- While this is not unexpected, to formally evaluate the amount of
increase would require a statistical simulation
- Simulate many datasets where the Normality and constant
variance assumption holds
- For each dataset, estimate
$\hat{\textrm{se}}(\hat{\beta_0})$ and
$\hat{\textrm{se}}(\hat{\beta_1})$
- Summarize the results
- Approach will be covered in a lab
## Linear Regression and Correlation
### Pearson's Correlation
- Notation
- $\rho$ signifies the population value
- $r$ (or $\hat{\rho}$) is the estimated correlation from data
- Formula
- $r = \frac{\Sigma(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\Sigma(x_i - \bar{x})^2\Sigma(y_i - \bar{y})^2}}$
- Range: $-1 \leq r \leq 1$
- Interpretation
- Measures the linear relationship between $X$ and $Y$
- Correlation coefficient is a unitless index of strength of
association between two variables (+ = positive association, - =
negative, 0 = no association)
- Can test for significant association by testing whether the
population correlation is zero t = which is identical to the
$t$-test used to test whether the population $r$ is zero;
$\textrm{d.f.} = n-2$
- Use probability calculator for $t$ distribution to get a
2-tailed $P$-value
- Confidence intervals for population $r$ calculated using
Fisher's $Z$ transformation
$$Z = \frac{1}{2} \textrm{log}_\textrm{e} \left( \frac{1+r}{1-r} \right)$$
- For large $n$, $Z$ follows a Normal distribution with standard
error $\frac{1}{\sqrt{n-3}}$
- To calculate a confidence interval for $r$, first find the
confidence interval for $Z$ then transform back to the $r$ scale
$$\begin{aligned}
Z & = & \frac{1}{2} \textrm{log}_\textrm{e} \left( \frac{1+r}{1-r} \right) \\
2*Z & = & \textrm{log}_\textrm{e} \left( \frac{1+r}{1-r} \right) \\
\textrm{exp}(2*Z) & = & \left( \frac{1+r}{1-r} \right) \\
\textrm{exp}(2*Z) * (1-r) & = & 1 + r \\
\textrm{exp}(2*Z) - r * \textrm{exp}(2*Z) & = & 1 + r \\
\textrm{exp}(2*Z) - 1 & = & r * \textrm{exp}(2*Z) + r \\
\textrm{exp}(2*Z) - 1 & = & r \left(\textrm{exp}(2*Z) + 1\right) \\
\frac{\textrm{exp}(2*Z) - 1}{\textrm{exp}(2*Z) + 1} & = & r \\
\end{aligned}$$
- Example (Altman 89-90): Pearson's $r$ for a study investigating the
association of basal metabolic rate with total energy expenditure
was calculated to be $0.7283$ in a study of $13$ women. Derive a 95%
confidence interval for $r$
$$Z = \frac{1}{2} \textrm{log}_\textrm{e} \left( \frac{1+r}{1-r} \right) = 0.9251$$
- The lower limit of a 95% CI for $Z$ is given by
$0.9251 + 1.96*\frac{1}{\sqrt{n-3}} = 0.3053$
- The upper limit is $0.9251 + 1.96*\frac{1}{\sqrt{n-3}} = 1.545$
- A 95% CI for the population correlation coefficient is given by
transforming these limits from the $Z$ scale back to the $r$ scale.
$$\frac{\textrm{exp}(2*0.3053) - 1}{\textrm{exp}(2*0.3053) + 1} \hspace{.5cm} \textrm{to} \hspace{.5cm} \frac{\textrm{exp}(2*1.545) - 1}{\textrm{exp}(2*1.545) + 1}$$
- Which gives a 95% CI from 0.30 to 0.91 for the population
correlation
quarto-executable-code-5450563D
```r
fisher.z <- function(r) {.5 * log((1+r)/(1-r))}
fisher.z.inv <- function(z) {(exp(2*z)-1) / (exp(2*z)+1)}
fisher.z.se <- function(n) {1/sqrt(n-3)}
fisher.z.inv(fisher.z(0.7283)) #Should be original value
fisher.z(0.7283)
fisher.z(0.7283)-1.96*fisher.z.se(13)
fisher.z(0.7283)+1.96*fisher.z.se(13)
fisher.z.inv(fisher.z(0.7283)+c(-1.96, 1.96)*fisher.z.se(13))
```
### Relationship to regression
- Pearson's correlation ($\rho$) is directly related to linear
regression
- Correlation treats $Y$ and $X$ symmetrically, but we can relate
- $E[Y | X]$ as a function of $X$
- $E[Y | X] = \beta_0 + \beta_1 X$
- $\beta_1 = \rho \frac{\sigma_Y}{\sigma_X}$
- $E[Y | X]$: mean $Y$ withing groups having equal $X$
- $\beta_1$: difference in mean $Y$ per 1 unit difference in $X$
- $\rho$: true correlation between $Y$ and $X$
- $\sigma_Y$: standard deviation of $Y$
- $\sigma_X$: standard deviation of $X$
- More interpretable formulation of $\rho$
- $\rho \approx \beta \sqrt{\frac{\textrm{Var}(X)}{\beta^2\textrm{Var}(X) + \textrm{Var}(Y | X = x)}}$
- $\beta$: slope between $Y$ and $X$
- $\textrm{Var}(X)$: variance of $X$ in the sample
- $\textrm{Var}(Y | X = x)$: variance of $Y$ in groups having the
same value of $X$ (the vertical spread of data)
- Correlation tends to increase in absolute value as
- The absolute value of the slope of the line increases
- The variance of data decreases within groups that share a common
value of $X$
- The variance of $X$ increases
- Scientific uses of correlation
- The slope between $X$ and $Y$ is of scientific interest
- The variance of $Y$ given $X=x$ is partly of scientific
interest, but can also be impacted by restricting sampling to
certain values of another variable
- The variance of height given age is less if sampling just
males than if sampling both sexes
- The variance of $X$ is often set by study design (which is
often not of scientific interest)
- Hypothesis tests for a nonzero correlation are exactly the same
as a test for a nonzero slope in classical linear regression
- The statistical significance of a given value of $\rho$ depends
only on the sample size
- Correlation is more of a statistical measure than a scientific
measure
## Regression on Geometric Means: Log transformations
- Geometric means of distributions are typically analyzed by using
linear regression on the log transformed outcome
- Used for inference under the following circumstances
- We are (scientifically) interested in multiplicative rather than
additive models
- We want to down-weight the impact of outliers
- The standard deviation of the response in a groups is
proportional to the mean
- "Error is $\pm 10\%$" instead of "Error is $\pm 10$"
### Interpretation of Parameters
- Linear regression on the log transformed $Y$
- Note that in statistics $\textrm{log}$ almost always refers the the
natural log ($\textrm{log}_e$)
- Model: $E[\textrm{log} Y_i | X_i] = \beta_0 + \beta_1 \times X_i$
- When $X_i = 0$: $E[\textrm{log} Y_i | X_i = 0] = \beta_0$
- When $X_i = x$:
$E[\textrm{log} Y_i | X_i = x] = \beta_0 + \beta_1 \times x$
- When $X_i = x+1$:
$E[\textrm{log} Y_i | X_i = x + 1] = \beta_0 + \beta_1 \times x + \beta_1$
- Restate the model as log link for geometric mean
- Model:
$\textrm{log GM}[Y_i | X_i] = \beta_0 + \beta_1 \times X_i$
- When $X_i = 0$: $\textrm{log GM}[Y_i | X_i = 0] = \beta_0$
- When $X_i = x$:
$\textrm{log GM}[Y_i | X_i = x] = \beta_0 + \beta_1 \times x$
- When $X_i = x+1$:
$\textrm{log GM}[Y_i | X_i = x + 1] = \beta_0 + \beta_1 \times x + \beta_1$
- Interpretation of parameters by back-transforming model
- Model:
$\textrm{GM}[Y_i | X_i] = e^{\beta_0} \times e^{\beta_1 \times X_i}$
- When $X_i = 0$: $\textrm{GM}[Y_i | X_i = 0] = e^{\beta_0}$
- When $X_i = x$:
$\textrm{GM}[Y_i | X_i = x] = e^{\beta_0} \times e^{\beta_1 \times x}$
- When $X_i = x+1$:
$\textrm{GM}[Y_i | X_i = x + 1] = e^{\beta_0} \times e^{\beta_1 \times x} \times e^{\beta_1}$
### Example: FEV with Height
- Question: How does FEV differ across height groups?
- Scientific justification for summarizing with geometric mean
- FEV is a volume
- Height is a linear dimension
- Each dimension of lung size is proportional to height
- Standard deviation is likely proportional to height
- Science: $FEV \propto Height^3$ or
$\sqrt[3]{FEV} \propto Height$
- Statistics: $\textrm{log}(FEV) \propto 3 \textrm{ log}(Height)$
- Science dictates model choice
- Statistical preference for (any) transformations of response
- May transform to equal variance across groups
- Homoscedasticity allows for easier inference
- Statistical preference for a log transformation
- Easier interpretation: multiplicative model
- Compare groups using ratios
### Log Transformed Predictors
- Interpretation of log predictor and log link for response
- Log link used to model the geometric mean of $Y$
- Exponentiated slope estimates ratio of geometric means across groups
- Log transformation of predictor compares groups with a $k$-fold
difference in there measured predictors
- Estimated ratio of geometric means
- $\exp (\textrm{log}(k) \times \beta_1) = k^{\beta_1}$
- Interpretation of model
- $\textrm{log GM}[FEV_i | log (height_i)] = -11.9 + 3.12 \times log(height_i)$
- Estimate ratio of geometric mean FEV for two groups differing by
$10\%$ in height (a 1.1 fold difference in height)
- To find: Exponentiate $1.1$ to the slope: $1.1^{3.12} = 1.35$
- "Group that is 10% taller is estimated to have a geometric mean
FEV that is 1.35 times higher (35% higher)"
- Why transform the predictor?
- Typically chosen according to whether the data are likely to
follow a straight line relationship
- Linearity necessary to predict the values of the parameter in
individual groups
- Linearity is not necessary to estimate existence of an
association
- Linearity is not necessary to estimate a first order trend in
the parameter across groups having the sampled distribution of
the predictor
- (Inference about these two questions will tend to be
conservative if linearity does not hold)
- Rarely do we know which transformation of the predictor provides
the best linear fit
- As always, there is significant danger in using the data to
estimate the best transformation to use
- If there is no association of any kind between the response and
the predictor, a linear fit (with a zero slope) is correct
- Trying to detect a transformation is thus an informal test for
an association (multiple testing, which inflates type I error
rate)
- It is best to choose transformations of the predictor on
scientific grounds
- In practice, it is often true that mean functions are well
approximated by a straight line over a small range of the data
- Example: In the FEV dataset, the logarithm of height is
approximately linear over the range of heights sampled (look at the
plot)
- We can often choose untransformed predictors when science would
suggest a non-linear association
- Can have major advantages in interpreting the results of the
analysis
- In the FEV example, it is far more natural to compare heights by
difference than by ratios
- We would rather characterize two children as differing by 4
inches in height rather than as the 44 inch child being 10%
taller than the 40 inch child
### Looking Ahead to Multivariable Models
- The relative importance of having the "true" transformation for a
predictor depends on the statistical role
- Predictor of Interest
- Effect Modifiers
- Confounders
- Precision variables
- Earlier I made a comment about treating these variables different in
an analysis
- Predictor of Interest (should be pre-specified)
- In general, don't worry about modeling the exact association
until you have even established an association (binary search)
- Need to have a prespecified model for the POI
- Searching for the best fit can inflate the type I error
- Make most accurate, precise inference about the presence of the
association first
- Exploratory models can suggest models for future analyses
- Effect Modifiers (difficult to do right anyway)
- Modeling of effect modifiers is invariably just a test for the
existence of an interaction
- We rarely have a lot of precision to answer questions in
subgroups of the data
- Patterns of interaction can be so complex that it is unlikely
that we will really capture the interactions across all
subgroups in a single model
- Typically, we would restrict future studies to analyses treating
subgroups separately
- Confounders (matters the most)
- It is important to have an appropriate model of the association
between the confounder and the response
- Failure to accurately model the confounder means that some
residual confounding will exist
- However, searching for the best model may inflate the type I
error for inference about the predictor of interest by
overstating the precision of the study
- Luckily, we rarely care about inference for the confounder, so
we are free to use inefficient means of adjustment (e.g. a
stratified analysis)
- Precision Variables
- When modeling precision variables, it is rarely worth the effort
to use the "best" transformation
- We usually capture the largest part of the added precision using
crude models
- We generally do not care about estimating association between
the response and the precision variable
- Most often, precision variables represent known effects on the
response
8.4 Comments
Age - Ch olesterol
As sumptions hold
Scenario 1
As sumptions hold
Scenario 3
I ncreasing variance
Scenario 2
Skewed variance
Point estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\) were the same for robust and classical standard error
Point estimates \(\hat{\textrm{se}}(\hat{\beta_0})\) and \(\hat{\textrm{se}}(\hat{\beta_1})\) were different for robust and classical standard error
In general, it is hard to predict the effect of using robust standard errors
These were just a a couple of examples and do not constitute a thorough examination of the impact of using robust standard errors
When the model assumptions hold, for these two scenarios, robust standard error estimates were larger than classical standard error estimates
While this is not unexpected, to formally evaluate the amount of increase would require a statistical simulation
Simulate many datasets where the Normality and constant variance assumption holds
For each dataset, estimate \(\hat{\textrm{se}}(\hat{\beta_0})\) and \(\hat{\textrm{se}}(\hat{\beta_1})\)
Summarize the results
Approach will be covered in a lab